Scheduling system for computational work on heterogeneous hardware

ABSTRACT

The technology includes methods, processes, and systems for virtualizing graphics processing unit (GPU) memory. Example embodiments of the technology include managing an amount of GPU memory used by one or more processes, such as Application Programming Interfaces (APIs), that directly or indirectly impact one or more other processes running on the same GPU. Managing and/or virtualizing the amount of GPU memory may ensure that an end user does not receive a GPU out-of-memory error because the API request is impacted by the processing of other API requests. A virtual machine with access to a GPU may be organized with one or more job slots that are configured to specify the number of processes that are able to run concurrently on a specific virtual machine. A process may be configured on each virtual machine running a software program or API and is used to schedule work based on GPU memory requirements.

CROSS-REFERENCE TO RELATED APPLICATIONS

This Non-Provisional Patent Application claims the right of priority toand the benefit of pending U.S. Non-Provisional patent application Ser.No. 16/136,075, filed Sep. 19, 2018, entitled, “SCHEDULING SYSTEM FORAPPLICATION PROGRAMMING INTERFACES ON HETEROGENEOUS HARDWARE” (ClientReference No. 065-001US1), which claims the benefit of earlier filingdate and right of priority to U.S. Provisional Patent Application SerialNo. 62/561,190, filed on Sep. 20, 2017, entitled “SCHEDULING SYSTEM FORAPPLICATION PROGRAMMING INTERFACES ON HETEROGENEOUS HARDWARE” (ClientReference No. 065-001PR0), of which the specification, claims, andfigures thereof are incorporated herein by reference in theirentireties.

BACKGROUND

Requests, such as program requests, application requests, applicationprogramming interface (API) requests, and the like, which require amachine with access to a graphics processing unit (GPU) to process therequest can use varying amounts of memory, at various times in theirlifecycle. Generally, memory restrictions are not put on machines withaccess to GPUs, which can lead to requests using too many resources,such as GPU memory, and interfering with other users or other requests.GPU memory usage can be a problem in multiple ways, such as while amachine is actively handling a request or after the request hascompleted because GPU memory is not freed.

Currently, GPU memory is not managed or virtualized in a manner like howCPU memory is managed, so the amount of GPU memory taken by one processdirectly impacts another process. It is difficult to enforce an actuallimit on the amount of resources, such as GPU memory, that a request canbe allowed to use. Lack of enforcement can cause problems where a systemhas more than one user running API requests at a time, such problems mayinclude efficiency or scheduling problems when more resources arerequested than are available.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing will be apparent from the following more particulardescription of example embodiments of the technology, as illustrated inthe accompanying drawings. The drawings are not necessarily to scale,emphasis instead being placed upon illustrating embodiments of thepresent invention.

Various embodiments in accordance with the present disclosure will bedescribed with reference to the drawings, in which:

FIG. 1 illustrates an example computing environment in accordance withsome embodiments of the subject technology;

FIG. 2 illustrates an example computing environment in accordance withsome embodiments of the subject technology;

FIG. 3 illustrates an example computing environment in accordance withsome embodiments of the subject technology;

FIG. 4 conceptually illustrates a virtual machine, such as the virtualmachine shown in FIGS. 1-3, in accordance with some embodiments of thesubject technology;

FIG. 5A is an illustrative example of a state diagram that specifies afew of the different possible states of slots in a host in accordancewith at least one embodiment shown;

FIG. 5B is an illustrative example of a graph showing GPU memory overtime in a host machine in accordance with at least one embodiment shown;

FIG. 6 illustrates an example process for performing process requestallocation to a host that can be utilized in accordance with variousembodiments;

FIG. 7 illustrates an example process for performing process requestallocation of persistent slots on a host that can be utilized inaccordance with various embodiments;

FIG. 8 is a block diagram illustrating exemplary components of a sampleGPU in accordance with an embodiment;

FIG. 9 illustrates example components of a client computing device inaccordance with various embodiments; and

FIG. 10 illustrates an environment in which various embodiments can beimplemented.

DETAILED DESCRIPTION

In the following description, various embodiments will be described. Forpurposes of explanation, specific configurations and details are setforth in order to provide a thorough understanding of the embodiments.However, it will also be apparent to one skilled in the art that theembodiments may be practiced without the specific details. Furthermore,well-known features may be omitted or simplified in order not to obscurethe embodiment being described.

Techniques described and suggested herein include methods, processes,and systems for virtualizing graphics processing unit (GPU) memory.Generally, GPU memory is not managed so the amount of memory taken byone process directly impacts another process. If a single process, suchas an application programming interface (API) request to process userdata according to an algorithm, is actively running/being executed on aGPU, no other processes may be run on that same GPU despite the amountof GPU memory being used by the process. A GPU according to exampleembodiments presented herein may be configured to load one or moreprocesses at the same time.

Example embodiments include a method, process, system, andcomputer-readable medium to implement one or more memory virtualizationlayers for a parallel computing platform and application programminginterface (API) model. The parallel computing platform may be a softwarelayer configured to provide access, such as direct access, to a graphicsprocessing unit (GPU) for general purpose processing, also referred toas General-Purpose Computing on Graphics Processing Units (GPGPU). GPGPUgenerally refers to the use of a GPU configured to perform computationsin applications traditionally performed by a central processing unit(CPU).

A GPU may be configured as an integrated component with a CPU on a samecircuit, graphics card, and/or motherboard of a physical endpointdevice, such as a computer, computing device, or server. Exampleembodiments presented herein may also refer to CPUs and GPUs generally,where such units may be a virtual CPU and/or a virtual GPU. A virtualGPU is generally configured to render graphics on a virtual desktop viaa virtual machine, where graphical instructions may be relayed via aproxy, such as hypervisor, from a virtual desktop to a physical GPU. Forexample, a virtual GPU is a computer processor configured to rendergraphics on a server, such as a host server, of a virtual machine, asopposed to rendering the graphics on a physical device, such as aphysical hardware device. A virtual GPU may be configured to offloadgraphics processing power from a server CPU in a virtual desktopinfrastructure. However, GPUs and virtual GPUs may be configured toand/or enabled to perform highly parallelizable, complex calculationsand determinations that are currently not considered, and which aredescribed herein. A virtual CPU, also referred to as a virtualprocessor, may include a physical CPU being assigned to a virtualmachine. While example embodiments presented herein discuss applicationswith reference to a virtual machine for simplicity and consistency, someor all of the embodiments may be performed on hardware or circuits of aphysical endpoint device.

Typically, a GPU (example depicted in FIG. 8), which is a specializedcomputer processor, performs computations for computer graphics, such asreal-time or near real-time graphics compute-intensive processes, orother highly parallelizable computations, such as neural networks,machine learning systems, and/or general artificial intelligence. Forpurposes of simplicity in explanations herein, example embodiments willbe described with reference to API requests; however, it will beunderstood by those having ordinary skill in the art that exampleembodiments and implementations of the processes, systems, andnon-transitory computer-readable mediums can further be used with and/orfor mathematical calculations, complex algorithms, intensiveapplications, and other processes. For example, a GPU may be configuredto perform algorithmic and/or mathematical calculations forcomputation-intensive applications that may otherwise strain a CPUand/or degrade performance of such calculations, results, or outcomes ofthe computed process.

FIG. 1 illustrates an environment 100 in which a user 101 may transmit arequest to be processed according to one or more example embodiments.

API requests that may use or require GPUs can use varying amounts ofresources, such as GPU memory, at various times in the lifecycle of theAPI. If no restrictions are placed on the API usage of the GPU, it canlead to APIs using too many resources and interfering with other APIsfrom the same or other users. For example, APIs may persist acrossmultiple requests and the memory usage may vary during different stagesof the API lifecycle.

Example embodiments include managing an amount of resources, such as GPUmemory, used by one or more processes, such as Application ProgrammingInterfaces (APIs), that directly or indirectly impact one or moredifferent/other processes. Managing and/or virtualizing the amount ofGPU memory may ensure that an end user, such as a user transmitting anAPI request that is processed via one or more GPUs, does not receive aGPU out-of-memory error because the API request is impacted by theprocessing of one or more other API requests.

In the example embodiment depicted in FIG. 1, a user 101 may transmit anAPI request 111 through an application executed by the user's client102, such as a user's computing device. An API server 106 may include ascheduler module 112 configured to initialize one or more processes on asingle host or virtual machine and is further configured to routetraffic to a specific slot on a specific host, e.g., a specific virtualmachine, such as VM 108. The scheduler module 112, which can be asoftware agent, daemon, processor, or the like, may communicate with oneor more API servers (e.g., one or more virtual machines/serversconfigured to provision the virtual machines) and/or be operablyconnected to a scheduling database 125 for use in scheduling APIrequests.

A computer program or software agent, such as a background-runningprocess or daemon process configured to run on a machine or virtualmachine being monitored, may be configured to run on one or more virtualmachines (also referred to as “workers” or “hosts”). Using a workerdaemon 120, one or more processes may be configured to route usertraffic, such as API request 111, to specific slots 114 a-d running onthe virtual machine 108. The worker daemon 120 may be further configuredto initialize, start, and/or stop slots on the VM. The worker daemon maybe operably connected with a scheduler 112, such as a process toschedule traffic or determine the scheduling of API requests receivedfrom one or more API servers. The scheduler module 112 may be directlyconnected to the worker daemon 120 or be operably connected via the APIserver 106.

In one example embodiment, when an API request 111 is received at an APIserver 106, a scheduler module 112 connected to the API server 106determines, based at least in part on information provided in therequest 111, information from the worker daemon 120, and/or informationfrom the scheduling database 125, on which virtual machine the APIrequests will be processed. For example, the API server may assign theAPI request to be run in a specific slot of a specific virtual machine(VM).

The specified VM executes the API requests and provides it with accessto the GPU if necessary according to information related to or from theAPI, API request, database, or other source of information/input/data;or if decided by same to be used with the GPU. The worker daemon 120 forthe specified VM 108 maintains information related to the accessed GPU115, such as how much GPU memory is available at that specific moment intime. The worker daemon may further have access to information, via thescheduling database and/or API database, such as when additional GPUmemory will become available. The worker daemon for the specified VM mayfurther calculate the available GPU memory based, at least in part, oncurrently running, loading, and/or holding APIs also present indifferent slots of the specified VM (see FIG. 5B).

The scheduler module 112 may further be configured to create a mapassociating each API request 111 and a process identifier (PID) (notshown), which may identify the process running the specific API requeston the host 108. In some example embodiments, no maps may be created, oronly a set or subset of PIDs may be created in a map. The PID may beused to allow processes to be manipulated such as adjusting the APIrequest's priority, changing the status of the API request, terminatingthe API request, or the like. For example, when an API request isexecuted in a certain slot, the PID for that slot is known by the workerdaemon. An API request, such as API request 111, may include the API tobe run on the system, a model file, user account information, anInternet Protocol (IP) address, data/input to be used, an API version, amemory expectation, and/or other data that is related to processing theAPI request. Such information, whether received in an API request orfrom other requests, may be stored in the scheduling database 125. Ascheduling database 125 may be operably connected with the worker daemon120 and/or the scheduler module 112 to store information related to theGPU 115, CPU 117, RAM 119, and/or additional information related to VM108.

The worker daemon 120 may further be configured to poll or monitor oneor more GPUs 115 on a constant, intermittent, random, or determinedschedule to determine a status of the one or more virtual machines withGPU memory and/or resources available. For example, a GPU status mayinclude how much GPU memory is available, how much GPU memory is beingused by other processes, what the GPU processor rate is, and othercharacteristics related to GPUs.

The API server 106 may be operably interconnected to the schedulingdatabase 125 or one or more other databases that maintains and updatessuch information related to APIs. The API database may includeinformation and statistics gained from other received requests, when auser creates a new API or algorithm, what language an algorithm iswritten in, what requirements an API request needs, the expected memoryusage acquired over time, and other API specific information.

The scheduler module 112 may be configured to measure an amount ofmemory used by an API. The scheduler module may measure an amount ofmemory used after loading the API, memory usage while an API is activein a slot, peak memory usage while the API is active (e.g., loading,loaded, running, working, etc.) in a slot, memory usage after an APIrequest is complete, or other such measurements to determine an amountof available and/or used GPU memory. The memory usage may be recorded inthe scheduling database to indicate what APIs require, need, and/orrequest what amount of memory. The memory used by an API may also bedetermined based on the version of the API, such as a version of the APIbased on an update of the API, added features to the API, or otherchanges to the API that would change the version from the original API.

In further example embodiments, the API server 106 determines which slotto assign an API request 111 to by determining a slot “score.” A slotscore may be determined according to whether the slot is currentlyprocessing another request, how long a used slot will be engaged,whether the slot is empty, whether the slot is loaded with some or allof the input or data needed for another API request, informationreceived from the worker daemon such as the amount of GPU memoryavailable to the slot, whether a slot can be reused if it is alreadyloaded, or other information related to the slots on the VM. The APIserver 106, via the scheduler 112 or separately, may additionallydetermine attributes or requirements an API might have to influenceselection of specific VMs or slots in a VM. For example, selection maybe based at least in part on the type of file, whether an API is cachedon any available or unavailable VMs, whether the API calls other APIs orsub-processes, whether the sub-processes are currently loading, loaded,or cached on any available or unavailable VMs. A score may be calculatedfor a slot on the VM. In other example embodiments, a score may also becalculated for each slot on every VM and/or on a subset of VMs. Itshould be noted that scoring and slot determination may be made in othermethods including combinations, variants, and alternatives not providedherein.

In alternative example embodiments, in place of the API server 106 orthe scheduler 112 assigning the API request to a specific slot, the slotcan be selected based on resources used. For example, the worker daemon120 for the virtual machine 108 selected can determine if the slots ofthe VM can be assigned. The number of slots on a VM can be immutable orvariable such that the worker daemon can determine if the VM is beingunder-utilized or over-utilized. Based on the use determination, slotsmay be added or removed from the VM dependent, at least in part, uponavailable information to the worker daemon, such as the amount of GPUmemory available. For example, if the worker daemon for the VMdetermines that the number of slots 114 a-d are only using a smallamount of GPU memory, the VM can add more slots so more requests may beassigned to the VM to better use the GPU resources available to the VM.The slots may be immutable or variable in size; for example, all slotson a specific VM may be 1 GB each, or slots may vary in the amount ofmemory allotted to them. In some example embodiments, an API request mayoccupy multiple slots in a single VM. For example, if an API requestneeds 7 GB of GPU memory, the API request could be assigned to two slotsof a VM, where each slot provides 3.5 GB of GPU memory.

In some example embodiments, a database can record and monitor thestatus of an API call. For example, the database can determine that acertain API call is slow, so when an API takes a long time to load buttakes a short time to run, the API can be maintained in a permanent orsemi-permanent slot, referred to herein as a sticky slot or a persistentslot. The persistent slot may determine that a container is constantlyrunning for that particular API. The database, such as an analyticsdatabase, can be maintained to analyze the types of APIs that requirepersistent slots, the user of the persistent slots, and other relateddata about the API, such as load time, run time, memory usage, expectedvalue, etc. In other example embodiments, an API server, also referredto as an API virtual machine, is configured to receive an API requestfrom a user to get the requirements for the API request (e.g., GPU,language, memory needs, files, cached APIs, etc.).

FIG. 2 illustrates an example computing environment 200 in accordancewith some embodiments of the subject technology.

If the Application Programming Interface (API) server 206, beingoperably interconnected to each worker daemon 220 of each VM 208 withaccess to a GPU 215, determines that the current slot on a specific VMto which the API request was allocated does not actually have enough GPUmemory available to run the request without a failure, partial failure,error, etc., the API server may transfer, transmit, and/or assign therequest to a different VM with more GPU memory available. When such atransfer is successful, the new assignment may be made withouttransmitting an error message to the user based on the possible failureof being in the original GPU VM slot without enough memory.

If a given slot ends up using more memory than expected the workerdaemon 220 will start rejecting API requests (to other APIs) and willbegin transmitting messages or information to the API server 206 thatslot X is using lots of GPU memory. If requests continue to berescheduled on this worker, the API server will, at some point, see slotX is using too much memory and evict it. If the API server 206 or workerdaemon 220 determines or surmises that all API requests require 6GB ofmemory, then failures would only happen if multiple API requests arestarting at the same time, which use more than 6GB. As soon as a VMacquires too much memory, new requests, such as API request #2 (211 b)will be rejected and a failure response 213 may be returned to the user.The API server may be configured to route around workers in such astate.

For example, two API requests are allocated to different slots of thesame VM, where API request #1 (211 a) is assigned to slot 1 (214 a) ofthe VM 208 and API request #2 (211 b) is assigned to slot 2 (214 b) ofthe VM. If each of API request #1 and API request #2 is determined torequire or request more than 50% of GPU memory, one or both of therequests must be reassigned. The worker daemon 220 may determine orapproximate the GPU memory requirements of the requests based on, forexample, an algorithm expectation determination provided in the APIrequest or determined by the API server. If API request #1 (211 a) isbeing processed but API request #2 (211 b) has yet to be processed, APIrequest #2 (211 b) can be rejected or transferred. The worker daemon forthe VM assigned to API requests #1 and #2 may determine that API request#2 (211 b) can be transferred to a different VM with more GPU memoryavailable without notifying the user of the API request #2 (211 b) thata failure would have occurred if API request #2 would have run in theoriginal VM.

The API server 206 or scheduler module 212 may be configured to providerequest optimization from one or more clients 202 a-b. Optimization, iftwo or more requests, such as a first request from a first client and asecond request from a second client, are received at the API server 206at or around the same time, may include information about the request,the API, the input in the request, or other information related tomemory usage. In other example embodiments, the requests may be from thesame client, contain the same input for the different requests, or othercombinations. There may be any number of requests, clients, users,and/or input received. For example, if it is known that a first requestruns for a certain amount of time, then the second request could beplaced in a queue behind the first request to use the same slot in whichthe first request is loaded. This is because it may take less time toqueue the second request, than to spin-up a new slot and/or a new VM,load the second request, and run the second request. For example, if ittakes three minutes for an API to load in a slot, but the APIruns/executes quickly, it may be beneficial to reuse the same slot thathas the API pre-loaded with new input or data from one or moresubsequent API requests. In other words, the API is loaded in a slot,and new data/input for the API is loaded into the same slot after thefirst API request was sent.

If all slots are currently used and no new API requests can be putanywhere, the new API requests 211 a-b are put in a queue. Thisinformation is put into the scheduler 212 and queued requests may beprioritized into a score (this can happen before the scheduler or at thescheduler). In some example embodiments, if a slot cannot be found andthe request is put into the queue, and then a slot is emptied when adifferent request finishes up. Before the slot is emptied of the API,the API server 206, and/or the worker daemon 220 may be configured toreview queued requests to determine if queued requests could use theloaded slot and, if so, the VM 208 can pull from the queue to be put inthat slot. In some such examples, the API server and/or the workerdaemon may be configured to check for additional information in thequeues, such as whether the request is from the same user or a differentuser. This may be considered separately from or in conjunction with theranking/scoring of slots and may be based on evaluating the queue basedon loaded APIs, for example.

If the queue (not shown) starts filling up, more workers, such as morevirtual machines, may be added on, initialized, or spun-up. An autoscaler 207 may be a component of or operationally integrated with thescheduler 212; for example, the autoscaler may be a process or daemon ofthe API server 206 configured to be triggered to launch new VMs, add newslots to existing VMs, and/or designate more capacity to a VM in a GPUpool (see FIG. 3). In example embodiments when the scheduler 212 isunable to perform the function, for example when the scheduler isbacked-up/lagging and unable to process the amount of work and/orrequests received, the autoscaler 207 may perform the same or similarfunctions to assist the overworked scheduler.

In further example embodiments, the autoscaler 207 may be configured todestroy or teardown the virtual machines when one or more VMs are nolonger needed. A VM may no longer be need due to lack of incomingrequests, time of day, use of historical data (e.g., daily or weeklypatterns based on previous number of workers, requests), use of neuralnetworks to predict and/or determine capacity needed, or other reasons.In many embodiments, the autoscaler 207, alone or in combination withother modules such as the scheduler 212, may provide for extra capacityto handle additional requests, such as incremental or sudden increasesin request volume. The autoscaler 207 may be configured to predict,attempt to predict, or forecast how long (e.g., time) and how many(e.g., numbers/amounts of) requests in the queue may be acknowledgedbefore the API requests time out. In further example embodiments, theautoscaler 207 may be configured to maintain a pool or series of stoppedvirtual machines to be started on an as-needed basis without furtherconsideration of slots.

In some example embodiments, a level of fairness is added to thedeterminations by randomizing the queue. For example, the queue (e.g.,such as a queue in the scheduler 212 or API server 206) may be randomlyreordered to ensure a burst of calls (e.g., API requests) from a firstuser does not starve a second user making a single call. In otherembodiments, the score may be manipulated by a randomness determination.For example, randomness may be assigned to a slot score to ensure thesame slot or same VM is not always selected, to ensure the scoresbetween slots are not always decided in the same manner, to ensure VMswith many empty slots are not always left empty, etc. Slot scores mayfurther be adjusted with the ability to consider some or all thevariables and introducing information about the VMs and/or slots so asto spread the workload over different available resources.

In other example embodiments, if a virtual machine or host (beingconnected and/or operably connected to a GPU 215, a CPU 217, a randomaccess memory (RAM) 219, and/or other resources/hardware 223) does nothave X amount of memory available at that moment, the schedule modulemay be configured to return the API request back to the API server. TheAPI server will transmit the API request to another VM. At or around thesame time as the request transfer, the API server will update thescheduling database regarding available memory. The API server canfurther transmit a response to the user of the API request explainingthe transfer of the API request.

FIG. 3 illustrates an example computing environment 300 in accordancewith some embodiments of the subject technology.

A processing server, such as the API server 306, is operablyinterconnected to one or more worker pools 307 and 309, configured toreceive processing requests directly from clients 302 a-b or via one ormore load balancers 304. The worker pools consist of GPU and CPU enabledvirtual machines 307 and CPU only enabled virtual machines 309. A firstprocessing request 311 a is received at the processing server 306, whichdetermines the request 311 a requires or would be benefited by the useof a GPU-enabled virtual machine. As such, the processing request 311 ais assigned to VM 308 a by one of the processing servers 306 receivingthe request. When a processing request is received at processing server,the request contains information as to what the request might need. Insome example embodiments, the request received directly from the client302 a may only include the process and input data to be run. Once theprocess request is received at the processing server, the processingserver can determine how much memory the request is likely to take, howmuch GPU memory could be needed, if the request requires a GPU, if therequest requires any special or different requirements, or the like.

The processing server 306 is further configured to determine a workerpool, such as a pool of virtual machines, which may be used to process arequest 311 b that only require or may be suited for CPU-only enabledVMs, such as VMs 308 c-d. The worker pools may include CPU only workers,CPU and GPU workers, or virtual machines with other characteristics.Worker pools may be created as more virtual machines are needed based atleast in part on incoming request load. Worker pools may include anynumber of virtual machines and may be located in different regions anddata centers around the world. Worker pools may further consist ofphysical machines, virtual machines, a combination of virtual andphysical machines, or the like.

Once a worker pool is selected, the processing server 306 assess eachvirtual machine 308 a-d in the selected worker pool 307 and/or 309 todetermine if the request can be assigned to a specific VM. Theprocessing server 306 may be configured to use information stored in themonitoring database and/or a scheduling database (see FIGS. 1 and 4) asnecessary, required, and/or considered. Once a VM is selected, theprocessing server 306 further determines a slot on the chosen VM towhich to assign the request. In common parlance this may be referred toas receiving work (e.g., job, request, message, etc.) into a slot, pod,or other container. In alternative example embodiments, the processingserver 306 may not select a specific or exact VM and/or slot, butinstead place the request in a queue. In such embodiments, an availableVM and/or an available slot may retrieve the request without having apredetermined location.

Multiple types of pools, pools for CPUs, pools for GPUs, or pools forother types of hardware currently known or hereinafter used in the art,to help with scheduling/for scheduling purposes/for use with thescheduler. Pools may be divided based on hardware type, region, sizelimits, or other constraints.

FIG. 4 conceptually illustrates virtual machine components 400, such asthe virtual machine shown in FIGS. 1-3, in accordance with someembodiments of the subject technology.

In one example embodiment, a virtual machine (VM) 408 with access to aGPU may be organized with one or more slots, such as job slots 414 a-d,that are configured to specify the number of processes, for example thenumber of API requests, that are allowed or able to run concurrently ona specific virtual machine. A VM generally cannot run more concurrentjobs than it has slots. The size of a slot may be defined as the memory,CPU, and/or GPU resources that reservation requirements for the specificvirtual machine. The slots may be a logical representation of the GPUmemory made available to the virtual machine. A slot creator module 416may divide each virtual machine or host machine into a fixed or variablenumber of slots and assign requests to available free slots (forexample, the scheduler may assign API requests to free slots based on around-robin algorithm, lowest slot number first, first-in-first-outqueue consideration, or other allocation schemes).

In one example embodiment, a worker daemon 420, which may be a daemon orother process, is configured on each virtual machine 408 running asoftware program or API, and is used to schedule work based, at least inpart, on GPU memory requirements. The worker daemon 420 or other processis configured to track available memory at a GPU; the worker daemon maydetermine the currently available memory at the time when a userrequest, such as an API request, is received. The worker daemon 420 foreach VM 408 monitors the slots 414 a-d on each VM. The worker daemonpolls the slots of its VM to determine if the accessible GPU maintainsenough available memory for the API request in the assigned slot to berun.

The worker daemon 420 may further be configured to determine anapproximate amount of available memory at or around the time an APIrequest is received. The worker daemon 420 may further be configured todetermine that the API request may be scheduled when there is enoughfree GPU memory. If there is not enough free GPU memory, for example, ifthere are no available or free slots, the worker daemon may determine ifmemory may be made available by evicting an API loaded in a slot orotherwise occupying (e.g., loading, running, working, etc.) the slot.

The worker daemon 420 may include a metering daemon 418 to receive andrecord user logs, to determine expected memory usage, and to adjust theexpected value, such as the expected amount of GPU memory usage. Inalternative example embodiments, the scheduler module (not shown) may beconfigured to determine and adjust the expected value and record thesame in a database. The metering daemon 416 may be configured to updatea database 410 so that information is up-to-date for the next APIrequest. Expected resource usage could be an expected runtime and thiswould allow for queuing API requests for the same APIs already loaded ina specific slot. For example, expected usage could be A amount ofmemory, B amount of time, and/or C percentage of memory. The workerdaemon 420 may further be configured to update the database 410constantly, intermittently, or at another determined or random intervalto determine the status of a GPU and/or GPU resources for the specificVM associated with the worker daemon.

The worker daemon 420 may determine if the unavailable slots, such asslot 3 (414 c) and/or slot 4 (414 d), may be evicted to provideadditional space, and, if so, use the evicted space to begin the APIrequest. If no available slots are found, and no slots can be evicted toprovide additional space, the worker daemon can return the API requestto a work queue, such as a queue in a RabbitMQ database, Kafka database,SQL database or the like, to allow the API server to reschedule the APIrequest to another VM. The worker daemon 420 may further return the APIrequest to a queue in an in-memory database, a message queuing system,or other similar constructs. In alternative example embodiments, thescheduling may be performed by the API server instead of the workerdaemon or scheduler.

In some example embodiments, an API may maintain a model file. The modelfile may store a learned portion of the API. For example, in many cases,a machine learning algorithm consists of two parts. A first part is thealgorithm itself and a second part is the model that is learned from thedata being processed by the API. For example, a single neural networktrained on different data sets or different input, where one modellearns to recognize nudity (for use by a nudity detection API) and asecond model learns to recognize color (for use by a color detectionAPI). The model file may store at least the learned part (e.g.,processed portion or partially processed portion) of the API even thoughthe neural network itself is the same for both APIs. In some exampleembodiments a model file may be bundled or incorporated with the APIcode itself. In alternative example embodiments, the model file could bestored in an external database that is configured to be accessible bythe API via worker daemon or component thereof, such as database 410.

In some example embodiments, an API may be configured to query, contact,or otherwise interact directly with the database 410 to retrieveavailable data or information related to the API. The worker daemon 420may further be configured to perform as a proxy for data, such as amodel file, which may include handling authorization for access to thedata, local caching of the data, logging of data requests, and the like.A scheduler, such as the scheduler module 112 of the API server 106 asdepicted and described in connection with FIG. 1, may be configured toaccount for usage patterns of APIs. For example, if an API alwaysrequires access to the same data or model file, the scheduler may beconfigured to execute that API on a VM that has the data cached and/oravailable.

Returning to FIG. 4, the slot creator 416 may be a daemon may beconfigured to monitor the status of API requests being processed in thespecific VM 408. For example, the slot creator 416 can be configured tomonitor if the APIs are loading, loaded, running, idle, terminating,etc. in their assigned slots. The slot creator 416 can further determineif an API request being run is going to fail partially through itsprocessing. If so, the worker daemon 420 may pause the current level ofrunning of the API request and transfer the partially-processed data andthe API request to a different VM that has the available memory for theAPI request to be continued and completed instead of restarting the APIrequest from the beginning.

Alternative example embodiments may include a scheduler moduletransmitting a saved state of the processed API request to serialize toa different slot on a different GPU VM to continue working on the savedstate. For example, an API that includes a stateless algorithm, process,or API, e.g., an API or process that does not make network calls, then asnapshot of the memory can be transmitted to the different GPU VM.

In alternative example embodiments, a VM may not have a predeterminednumber of slots but may allocate slots or resources dynamically. Inother words, slots or containers may not be used on the host or virtualmachine at all or in part. The VMs may be configured to query the queue,such as a queue in the API server 106 or the scheduler 112 as describedand depicted in connection with FIG. 1, to retrieve queued processes,jobs, API requests (work) as one or more

VMs have resources available. For example, a VM that has availableresources will contact a central work queue to receive new or updatedwork.

Returning to FIG. 4, as is depicted in some example embodiments, slotsare not a fixed construct, as such the scheduler creates and terminatesslots as needed. However, using the scheduler as a queue to maintain alist of requests, enables the worker daemon 420 to retrieve requestsfrom the queue without determining if slots are available and/or withoutrequiring an API server or scheduler to initialize a specific amount ofresources, e.g., a slot, on the VM.

FIG. 5A is an illustrative example of a state diagram 500A thatspecifies a few of the different possible states of slots in a host inaccordance with at least one embodiment shown.

Slots, including persistent slots, may have varying number of statessuch as empty, first loading, second loading, loaded, standby, running,evicted, terminated, or more. Each of these states can use differentamounts of GPU memory and this is a variable that is maintained by theworker daemon 120, or other process such as the API server 106 orscheduler 112 as depicted and descripted in connection with FIG. 1, sothat the slots available on that VM can be changed/varied.

Returning to FIG. 5A, at the INITIALIZING state (501 a), the process ofentering an API into a slot of a virtual machine begins, which turns tothe EMPTY state (502 a) upon determination that there is an availableslot. Post creation of the slot, the LOADING 1 state (503 a) uploadssome or all of the initialized/requested API. An empty slot state, suchas slots 1 or 2 (414 a-b) as depicted in FIG. 4, may represent a slotthat is available where no API or process is loaded or, in alternativeexamples where no API is running. Returning to FIG. 5A, a first loadingslot state may represent a slot that has been claimed for a load requestand a slot is being created.

The status of the slot changes from the LOADING 1 state (503 a) to theLOADING 2 state (504 a) in some embodiments where the runner (executorcode) determines additional data, code, and/or other information isnecessary, required, requested, and/or considered before completing theloading process of the API. In other embodiments, where the runnerdetermines no additional information is necessary, required, or the likebefore completing the loading process of the API, the LOADING 1 state(503 a) changes to a LOADED state (505 a).

Once the API is effectively loaded and being executed, the LOADED state(505 a) may persist until either the API is terminated, or the API isrunning. If the API is terminated, the LOADED state (505 a) changes to aTERMINATING state (507 a). If the API is running, the LOADED state (505a) changes to a RUNNING state (506 a)

When an API or process is still running in a slot, but no newinformation or data is being received in the slot, the RUNNING state(506 a) changes to a STANDBY state (508 a). For example, a standby slotstate may indicate a state that still has a job running in the slot. Theindication may mean that the entire VM is scheduled to be shut down andall the slots running on the VM or just a specific slot running on theVM are to be terminated, but work is still being processed or completed.The standby state may represent that a worker is preparing for ashutdown, and new work is no longer assigned to that slot while work isbeing drained. The STANDBY state (508 a) changes to a TERMINATING state(507 a) when the VM is preparing to be shut down. For example, aterminated slot state may indicate a state that is finished with therunning of the API request, the API is unloaded from the slot and is notin memory, though other slots on the same machine are still in standby,active, or other states such that the entire VM cannot be terminated.

In alternative example embodiments, a runner (e.g., executor) is used toabstract between different programming languages that interconnects thecode to be run to the rest of the platform. For example, the runner (orexecutor code) handles communication into and out of one or more slots.A first loading slot state may represent a slot that has been claimedfor a load request and a slot is being created. This may indicate thatthe runner has not been announced and may therefore not have receivedthe actual load request yet. A second loading state may represent arunner that has been initialized and been sent an actual load request.This may indicate that the slot is starting to run an API or processload.

In alternative example embodiments, such as those alternativeembodiments described in connection with FIG. 4 above, the statesdepicted in FIG. 5A may not be associated with specific slots but mayotherwise correlate to an amount of free resources, such as GPU memory,available on a given virtual machine or host. The available resourcesmay be determined according the state diagram; however, the need or useof slots is no longer dependent upon the states.

FIG. 5B is an illustrative example of a graph 500B showing GPU memory515 b over time 510 b in a host machine in accordance with at least oneembodiment.

To schedule resources, it is generally necessary to know how muchresources a given request will need. In some example embodimentshistorical data can be used to determine exactly or approximately howmuch resources are needed. For example, an API request can be expectedto use X amount of GPU memory 520 b, before the request runs. Forexample, when an API request runs or executes, the amount of memory usedis tracked historically, this information may be used to assume the sameor similar amount of memory may be used or needed for scheduling new orfuture work.

For example, a user, such as an API developer, has an API that runsquickly but has a large load time. The user has published this APIprivately and uses it for an application that the user's customers calldirectly. The user has noticed that at seemingly random times, the APIcalls take dramatically longer than others. To alleviate this problem,the user can create and use a persistent slot for that API. This willmake sure that there is always one or more slots for that caller withthe API the user wants to run. If the time between subsequent calls tothe API is slower than the time it takes to complete the call, the callsto the API from this user may never experience the “cold start” problem(e.g., an empty slot without data, input, processes, API, etc.).

When a program or process is executed there can be two phases; a firstphase that may include loading before input data is received or neededand a second phase that may include actual processing of the requestincluding the input data. When the loading of an API, such as a neuralnetwork for example, takes a long time, a user may want to maintain theneural network loaded for as long as possible and as many requests aspossible so there is no need to pay the loading cost (e.g., money, time,resources, etc.) multiple times. In other scenarios, when the loading ofthe neural network is fast/quick, the API can be unloaded and reloadedas needed without incurring as much cost. For example, preemptive APIsmay be loaded based on scheduler in response to historical usage,forecasting data, heuristics, and/or other information.

For persistent slots, and regularly scheduled slots, a load request canbe separate from an API request, such that a user can trigger a loadrequest for a persistent slot without actually transmitting an APIrequest. For example, a user can request to load an API, algorithm, orother process. At which point the request begins to run and is thencompleted, then the API, algorithm, or other process is maintained in aloaded state in the persistent slot. If the persistent slot is to beended, as in the container is to be terminated, the API or algorithm isevicted from the persistent slot.

FIG. 6 provides an illustrative example of the process 600, which may beused to schedule a user's request to an available host based, at leastin part, on available GPU resources. The process 600 may be performed bya suitable system or component of a system described herein, such as theAPI server 106, the scheduler 112, and/or the worker daemon 120described above in connection with FIG. 1.

Returning to FIG. 6, the process 600 includes receiving a request from auser via a user's client (602). The request may include any request toperform calculations, complex applications, or other processes, such asan application programming interface (API) request, to be processedusing a GPU. The process 600 further includes scheduling the receivedrequest to a virtual machine (604). For example, the process 600 mayreceive an API request such as the API request 211a as described aboveand depicted in connection with FIG. 2.

Returning to FIG. 6, as a result of the scheduling of the API request,the process 600 further includes determining if one or more GPUsoperably connected to the scheduled virtual machine contain enoughavailable memory to load and/or execute the API request (606). If theGPUs being connected to or accessible by the scheduled virtual machinemaintain the required available memory, the process 600 further includesassigning the API request to a slot or container of the virtual machine(608). In some example embodiments provided herein, the API request maybe assigned or scheduled to a VM according to a queuing system orprocess, as described in connection with FIGS. 3 and/or 4 above.

Returning to FIG. 6, if the GPU associated with the scheduled virtualmachine does not have enough memory available for the received APIrequest, the process 600 includes determining if there are any presentlyloaded APIs that may be evicted and/or terminated (610). For example, ifa previously received API request was scheduled into a slot on the samevirtual machine, where the loaded API could be removed, evicted, and/orterminated in order to provide resources for the received request. If itis determined that there are available loaded API(s) that may beevicted, the process 600 further includes evicting the determined API(612) and redetermines if there is available GPU memory available forthe received API request (606).

If it is determined that there are no loaded API(s) that can be evictedfrom active slots, the process 600 further includes rejecting the APIrequest or failing the API request (614). In response to a rejected orfailed API request, the process 600 includes reporting the failed orrejected API request to the user (616). According to some exampleembodiments, the user may not receive a failed or rejected API requestresponse.

FIG. 7 shows an illustrative example of the process 700, which may beused to create and/or schedule one or more persistent slots on one ormore virtual machines in accordance with at least one embodiment. Theprocess 700 may be performed by any suitable system, such as by a webserver, such as the web server 1004 as illustrated and described inconnection with FIG. 10 or any component operably connected thereto.

Returning to FIG. 7, the process 700 includes making available theoption for a user to add a persistent slot for the user's continued orprolonged use of a specific space, such as a specific amount ofresources or a specific amount of GPU memory associated with a virtualmachine (702). In some example embodiments, the user may add apersistent slot to the user's profile, where said profile may be storedin one or more databases, such as the scheduling database 125 and/or thedatabase 410 as described and depicted in connection with FIGS. 1 and 4,respectively. One or more databases may contain, include, record, and/orotherwise provide for information associated with a user, a client, anAPI request, an API, or other authentication and application detailsgenerally understood in the art. The one or more databases may furtherbe controlled/managed by one or more services provided by or via a cloudservice provider, such as the data storage service 934, theauthentication service 935, and/or the metering service 936, asdescribed and depicted in connection with FIG. 9 below. In some exampleembodiments, the one or more databases may further be associated withone or more database servers, such as the database servers 1006 asdescribed and depicted in connection with FIG. 10 below.

Returning to FIG. 7, the process 700 further includes providing a listof current persistent slots paid for and the status of currentpersistent slots to the user (702). In response to a user's requesttransmitted through an application executed by the user's client, suchas a user's computing device, the process 700 further includes providingfor the deletion and/or disabling of a persistent slot from the database(716) and providing for the addition of a persistent slot to thedatabase (706). In response to a request to add a persistent slot, theprocess 700 includes providing for the selection, if applicable, of aversion of the API to be added, loaded, and/or stored in the persistentslot (708). The process 700 further includes determining if the selectedversion of the API calls or executes additional or other APIs orprocesses (712).

If the process 700 determines that additional APIs are called, e.g., ifthe requested API uses one or more child processes to execute, theprocess 700 further includes suggesting the addition or creation of apersistent slot for any dependent processes or APIs (714). In at leastsome example embodiments, the process 700 further includes creating apersistent slot for the requested API, which may include adding a row toa database to indicate the persistent slot and its associated status(710).

Some or all of the processes depicted in FIGS. 6 and 7, includingprocess 600 and process 700, may be performed under the control of oraccording to the instructions of one or more computer systems configuredwith executable instructions. The processes 600 and 700 may beimplemented as code (e.g., executable instructions, computer program(s),or application(s)) executing on one or more processors, by hardware, orcombinations thereof. The code may be stored on a non-transitorycomputer-readable storage medium.

FIG. 8 is a block diagram 800 illustrating exemplary components of asample graphics processing unit (GPU) in accordance with an embodiment.The depicted example GPU 815 may include different components orsubsystems operably interconnected. The components of example GPU 815include a BUS interface 881, a graphics memory controller (GMC) 885, acompression unit 886, a graphics and compute array 887, a powermanagement unit 882, a video processing unit 883, and a displayinterface 884. The display interface 884 may, for example, be operablyinterconnected to a host machine, such as a physical machine and/or avirtual machine operating according to a virtualization layer on a hostmachine. The GPU provided for in example embodiments enables moreintensive processes to be loaded, executed, and completed in a mannerthat is more effective, more resource enhancing, and/or more likely tosucceed than the same or similar processes being executed or attemptedat being executed on a CPU. The example embodiment of a GPU in FIG. 8depicts and describes components of the specialized computing machineand may be designed in other ways according to GPUs (virtual, physical,and/or a combination of virtual and physical) currently known orhereinafter considered in the relevant art.

FIG. 9 depicts one example 900 of a client 902 connected to a cloudservice provider (CSP) 909 according to one embodiment. The CSP 909 mayprovide an assortment of services to a user through the user's client902 over a network 903 via an interface 931, where the network may beany type of network. The interface 931 may be one or more web serviceinterfaces where each of services 932-937 may include their own userinterface. The user may communicate with the CSP 909 via a network 903,such as the Internet, to cause the operation of one or more embodimentsdescribed herein. The CSP 909 services provided to users according tothis example include a virtual computing service 932, scheduling service933, a data storage service 934, an authentication service 935, ametering service 936, and one or more other services 937. The one ormore services may further be operably interconnected with one or moreother services provided by the CSP 909, such as connected via one ormore interfaces. Not all embodiments described herein include theservices 931-937 and additional and/or alternative services may beprovided.

The virtual computing service 932 may be a collection of computingresources configured to instantiate one or more virtual machines for useby the user. The user may communicate with the virtual computing service932 to operate the virtual machines initiated on physical computingdevices. In other example embodiments, other computer systems or systemservices may be employed that do not use virtualization and/or provisionapplicable computing resources on one or more dedicated physicaldevices, such as a web server or application server.

In one example embodiment, the scheduling service 933 may be acollection of computing resources to schedule requests to availableresources. For example, scheduling may be provided to a virtual machine(VM) with access to a GPU may be organized with one or more slots, suchas job slots, that are configured to specify the number of processes,for example the number of API requests, that are allowed or able to runconcurrently on a specific virtual machine. A VM generally cannot runmore concurrent jobs than it has slots. The size of a slot may bedefined as the memory, CPU, and/or GPU resources that reservationrequirements for the specific virtual machine. The slots may be alogical representation of the GPU memory made available to the virtualmachine. A scheduler module may divide each virtual machine or hostmachine into a fixed or variable number of slots and assign requests toavailable free slots (for example, the scheduler may assign API requeststo free slots based on a round-robin algorithm, lowest slot numberfirst, first-in-first-out queue consideration, or other allocationschemes).

The CSP 909 of FIG. 9 further includes the data storage service 934,which may be implemented to synchronously and/or asynchronously processrequests to access, store, create, or otherwise affect data. Forexample, the data storage service 934 may be configured to allow accessand retrieval of data associated with a user, a client, a request, anAPI, or other object described herein to allow data to be provided inresponse to requests.

The authentication service 935 may be one or more computing resourcesthat are configured to perform actions for authenticating a user. Themetering service 936 may provide for users to submit requests related tothe management of their user accounts, such as requests to add, delete,change, or modify account information, processing information, or otherpolicy information. The cloud service provider 909 may further maintainone or more other services 937 based at least in part on the needs orwants of the provider 909, the network 903, the client 902, or othercustomer/user requests.

FIG. 10 illustrates aspects of an example environment 1000 forimplementing aspects in accordance with various embodiments. While aweb-based environment is used for purposes of explanation and depiction,those of skill in the art may recognize that different environments maybe used to implement different embodiments provided herein. Theenvironment 1000 includes a client device 1002, which can include anyappropriate device operable to send and/or receive requests, messages,or other information, such as a personal computer, computer system, orcomputing device, over/via a network 1003, such as the Internet, a localarea network, or other network type. These devices also can includeother electronic devices, such as dummy terminals, thin-clients, gamingsystems and other devices capable of communicating via a network. Thesedevices also can include virtual devices such as virtual machines,hypervisors and other virtual devices capable of communicating via anetwork.

The example environment 1000 includes at least one web server 1004, atleast application server 1005, and at least one database server 1006,each or all of which may include several servers, layers, process,and/or other components configured to interact according to exampleembodiments presented herein. The servers 1004-1006 may be implementedin various ways, such as hardware devices or virtual computing systems.In some contexts, servers may refer to a programming module beingexecuted on a computing system. The database server(s) 1006 may includea device or combination of devices capable of storing, accessing, andretrieving data and/or may include any combination servers, databases,and storage devices in any standard, distributed, virtual, clustered, orotherwise organized environment.

The application server(s) 1005 may include any and all applicablesoftware, hardware, and firmware for integrating with a databaseserver(s) as needed to execute aspects of one or more applicationsand/or embodiments presented herein for the client device 1002. Theapplication server(s) 1005 may provide services alone or in cooperationwith the database server(s) and are able to generate content such astext, graphics, audio, video, and/or other content usable to be providedto the user. The management of requests and responses, as well as thedelivery of content between the client device 1002 and the applicationserver(s) 1005, may be accomplished by the web server 1004 usingappropriate server-side structured programming languages, such asPython, Ruby, Perl, JAVA®, HTML, XML, or the like.

As will be understood by one of ordinary skill in the art, exampleembodiments presented herein may not require web and applicationservers, as structured code discussed herein can be executed on anyappropriate device or host machine. In addition, embodiments andprocesses described herein may be performed collectively by multipledevices, which may form a distributed and/or virtual system.

The database server(s) 1006 are operable to receive instructions fromand/or send instructions or otherwise process data in response to theinstructions from to the application server 1005. The application server1005 may provide static, dynamic, or a combination of static and dynamicdata in response to the received instructions. Dynamic data, such asdata used in web logs (blogs), shopping applications, news services andother such applications may be generated by server-side structuredlanguages as described herein or may be provided by a content managementsystem (“CMS”) operating on, or under the control of, the applicationserver.

Each server may include an operating system that provides executableprogram instructions for the general administration and operation ofthat server and typically will include a non-transitorycomputer-readable storage medium (e.g., a hard disk, random accessmemory, read only memory, CPU, GPU, etc.) storing instructions that,when executed by a processor of the server, allow the server to performits intended functions. In some example embodiments, the server may bepartitioned into kernels, which use a single operating system thatprovides executable program instructions. Suitable implementations forthe operating system and general functionality of the servers are knownor available, being readily implemented by persons having ordinary skillin the art, particularly in light of the disclosure herein.

The depiction of the system 1000 in FIG. 10 should be taken as beingillustrative in nature and not limiting to the scope of the disclosure.

Various embodiments presented herein may utilize at least one networkthat would be familiar to those skilled in the art for supportingcommunications using any of a variety of commercially-availableprotocols, such as User Datagram Protocol (UDP), Transmission ControlProtocol/Internet Protocol (TCP/IP), protocols operating in variouslayers of the Open System Interconnection (OSI) model, File TransferProtocol (“FTP”), and other various protocols currently known orhereinafter applicable in the art. The network can be, for example, alocal area network, a wide-area network, a virtual private network, theInternet, an intranet, an extranet, other similar type networks, and anycombination thereof.

In embodiments utilizing a web server, such as web server 1004, the webserver may be configured to run any of a variety of server or mid-tierapplications, including Hypertext Transfer Protocol (HTTP) servers, FileTransfer Protocol (FTP) servers, Common Gateway Interface (CGI) servers,data servers, Java servers, Apache servers, and business applicationservers. The server(s) also may be capable of executing programs orscripts in response to requests from user devices, such as by executingone or more web applications that may be implemented as one or morescripts or programs written in any programming language, such as JAVA®,C, C# or C++, or any scripting language, such as Ruby, PHP, Perl,Python, or TCL, as well as combinations thereof.

The environment can include a variety of data stores and other memoryand storage media as discussed above. These can reside in a variety oflocations, such as on a storage medium local to (and/or resident in) oneor more of the computers or remote from any or all of the computersacross the network. Necessary files for performing the functionsattributed to the computers, servers or other network devices may bestored locally and/or remotely, as appropriate. Where a system includescomputerized devices, each such device can include hardware elementsthat may be electrically coupled via a bus, the elements including, forexample, at least one central processing unit (CPU or processor), aninput device (e.g., a mouse, keyboard, controller, etc.) and at leastone output device (e.g., a display device, printer, etc.). Such a systemmay also include one or more storage devices, such as disk drives,optical storage devices, and solid-state storage devices such as randomaccess memory (RAM) or read-only memory (ROM), as well as removablemedia devices, etc.

Such devices also can include a computer-readable storage media reader,a communications device (e.g., a modem, a network card (wireless orwired), etc.), and working memory as described above. Thecomputer-readable storage media reader can be connected with, orconfigured to receive, a computer-readable storage medium, representingremote, local, fixed, and/or removable storage devices, as well as othersuch devices for temporarily, semi-permanently, or permanentlycontaining, storing, transmitting, and retrieving computer-readableinformation. Storage media and computer readable media for containingcode, or portions of code, can include any appropriate media known orused in the art, including storage media and communication media, suchas, but not limited to, volatile and non-volatile, removable andnon-removable media implemented in any method or technology for storageand/or transmission of information such as computer readableinstructions, data structures, program modules or other data. Based onthe disclosure and teachings provided herein, a person of ordinary skillin the art will appreciate other ways and/or methods to implement thevarious embodiments

The specification and drawings are, accordingly, to be regarded in anillustrative rather than a restrictive sense. It will be evident thatvarious modifications and changes may be made thereunto withoutdeparting from the broader spirit and scope of the invention as setforth in the claims. Other variations are within the spirit of thepresent disclosure. Thus, while the disclosed techniques are susceptibleto various modifications and alternative constructions, certainillustrated embodiments thereof are shown in the drawings and have beendescribed above in detail. It should be understood, however, that thereis no intention to limit the invention to the specific form or formsdisclosed, but on the contrary, the intention is to cover allmodifications, alternative constructions and equivalents falling withinthe spirit and scope of the embodiments or technology, as defined in theappended claims.

The use of the terms “a,” “an,” “the,” and similar referents in thecontext of describing the disclosed embodiments (especially in thecontext of the following claims) are to be construed to cover both thesingular and the plural, unless otherwise indicated herein or clearlycontradicted by context. The terms “comprising,” “having,” “including,”and “containing” are to be construed as open-ended terms (i.e., meaning“including, but not limited to,”) unless otherwise noted. The term“connected,” when unmodified and referring to physical connections, isto be construed as partly or wholly contained within, attached to, orjoined together, even if there is something intervening. The terms“operably connected” or “operably interconnected” and the like may referto virtual and or physical connections and are to be construed aspartially or wholly contained within, attached to, or joined together,even if there are intermittent constructs or components. Recitation ofranges of values herein are merely intended to serve as a shorthandmethod of referring individually to each separate value falling withinthe range, unless otherwise indicated herein and each separate value isincorporated into the specification as if it were individually recitedherein. The use of the term “set” (e.g., “a set of requests”) or“subset” unless otherwise noted or contradicted by context, is to beconstrued as a non-empty collection comprising one or more members.Unless otherwise noted or contradicted by context, the term “subset” ofa corresponding set does not necessarily denote a proper subset of thecorresponding set, but the subset and the corresponding set may beequal. The use of the terms “first” and “second” are generallyconsidered to denote one ore more objects in a set, and there can be aninfinite or appropriate number of objects (e.g., a first, a second, athird, . . . nth, etc.).

Conjunctive language, such as phrases of the form “at least one of A, B,and C,” or “at least one of A, B and C,” unless specifically statedotherwise, is understood with the context as used in general to presentthat a term may be either A or B or C, or any non-empty subset of theset of A and B and C. Generally, such conjunctive language is notintended to imply that certain embodiments require at least one of A, atleast one of B, and at least one of C each to be present.

Operations of processes described herein can be performed in anyappropriate order unless otherwise indicated herein or otherwise clearlycontradicted by context. Processes described herein (or variationsand/or combinations thereof) may be performed under the control of oneor more computer systems configured with executable instructions and maybe implemented as code (e.g., executable instructions, one or morecomputer programs or one or more applications) executing collectively onone or more processors, by hardware or combinations thereof. The codemay be stored on a computer-readable storage medium, for example, in theform of a computer program comprising a plurality of instructionsexecutable by one or more processors. The computer-readable storagemedium may be non-transitory.

The use of any and all examples, or exemplary language (e.g., “such as”)provided herein, is intended to better illuminate embodiments of theinvention and does not pose a limitation on the scope of the inventionunless otherwise claimed generally. No language in the specificationshould be construed as indicating any non-claimed element as essentialto the practice of the invention.

Embodiments of this disclosure are described herein, including the bestmode known to the inventors for carrying out the invention. Variationsof those embodiments may become apparent to those of ordinary skill inthe art upon reading the preceding detailed description. The inventorsexpect skilled artisans to employ such variations as appropriate and theinventors intend for embodiments of the present disclosure to bepracticed otherwise than as specifically described herein. Accordingly,the scope of the present disclosure includes all modifications andequivalents of the subject matter recited in the claims appended heretoas permitted by applicable law. Moreover, any combination of theabove-described elements in all possible variations thereof isencompassed by the scope of the present disclosure unless otherwiseindicated herein or otherwise clearly contradicted by context.

All references, including publications, patent applications, andpatents, cited herein are hereby incorporated by reference to the sameextent as if each reference were individually and specifically indicatedto be incorporated by reference and were set forth in its entiretyherein.

What is claimed is:
 1. A computer-implemented method, comprising: underthe control of one or more computer systems configured with executableinstructions, managing an amount of Graphics Processing Unit (GPU)memory used by one or more processes, wherein the one or more processesdirectly or indirectly impact one or more other processes running on theGPU; organizing a host machine with access to the GPU according to oneor more request slots configured to specify a number of processes thatare available to be processed by the GPU; and scheduling the one or moreprocesses based at least in part on the GPU memory.
 2. Thecomputer-implemented method of claim 1, wherein the computer-implementedmethod further includes Application Programming Interface (API)requests, complex-interaction calculations, neural networks, artificialintelligence, or other computation-intensive application.
 3. Thecomputer-implemented method of claim 1, wherein the computer-implementedmethod further includes: optimizing a queue, in response to receivingtwo or more requests at or around a same time; determining an amount oftime to process a first request of the two or more requests; determiningan amount of time to process a second request of the two or morerequests; and ordering the first request and the second request in aqueue, wherein the queue is configured to store the two or morerequests.
 4. The computer-implemented method of claim 1, wherein thecomputer-implemented method further includes: receiving, from a clientdevice, a request to add a persistent slot; scheduling one process ofthe one or more processes in the persistent slot; and determining if theone process executes one or more child processes.
 5. A system,comprising: at least one computing device configured to implement one ormore services, wherein the one or more services are configured to:manage an amount of Graphics Processing Unit (GPU) memory used by one ormore processes, wherein the one or more processes directly or indirectlyimpact one or more other processes running on the GPU; organize a hostmachine with access to the GPU according to one or more request slotsconfigured to specify a number of processes that are available to beprocessed by the GPU; and to schedule the one or more processes based atleast in part on the GPU memory.
 6. The system of claim 5, wherein theone or more processes are received from one or more client devices. 7.The system of claim 5, wherein the at least one computing device isfurther configured to: receive, from a client device, a request to add apersistent slot; schedule one process of the one or more processes inthe persistent slot; and determine if the one process executes one ormore child processes.
 8. The system of claim 5, wherein the at least onecomputing device is further configured to: optimize a queue, in responseto receiving two or more requests at or around a same time; determine anamount of time to process a first request of the two or more requests;determine an amount of time to process a second request of the two ormore requests; and order the first request and the second request in aqueue, wherein the queue is configured to store the two or morerequests.
 9. The system of claim 8, wherein the at least one computingdevice is further configured to: receive at least one of information,input, and data associated with the one or more processes; and store theat least one of information, input, and data in a database operablyconnected to the queue.
 10. The system of claim 5, wherein the hostmachine is a virtual machine operably connected to a GPU.
 11. The systemof claim 5, wherein one or more processes include ApplicationProgramming Interface (API) requests, complex-interaction calculations,neural networks, artificial intelligence, or other computation-intensiveapplications.
 12. The system of claim 5, wherein the at least onecomputing device is further configured to: launch a new host machine;create a new slot on an existing host machine; and designate additionalresources to the host machine from a pool of host machines operablyconnected to the GPU.
 13. A non-transitory computer-readable storagemedium having stored thereon executable instructions that, when executedby one or more processors of a computer system, cause the computersystem to at least: receive, from a client device, a request to processdata associated with the request; schedule the request to one or moreresources of a Graphics Processing Unit (GPU); identify an amount of GPUresources being available to process the request using at least the dataassociated with the request; and assign the request to the GPU.
 14. Thenon-transitory computer-readable storage medium of claim 13, wherein theinstructions further comprise instructions that, when executed by theone or more processors, cause the computer system to provide, to theclient device, the processed request.
 15. The non-transitorycomputer-readable storage medium of claim 14, wherein the instructionsthat cause the computer system to provide the processed request furtherinclude instructions that cause the computer system to maintain therequest assigned to the GPU after the processed request is provided tothe client device.
 16. The non-transitory computer-readable storagemedium of claim 13, wherein the instructions further compriseinstructions that, when executed by the one or more processors, causethe computer system to maintain at least one of the request, the dataassociated with the request, and the processed request.
 17. Thenon-transitory computer-readable storage medium of claim 16, wherein theinstructions further comprise instructions that, when executed by theone or more processors, cause the computer system to: receive one ormore additional requests, the one or more additional requests includingnew data associated with the request; retrieve the request maintained bythe computer system; and process the one or more additional requestsbased on the request.
 18. The non-transitory computer-readable storagemedium of claim 13, wherein the instructions that cause the computersystem to schedule the request further include instructions that, whenexecuted by the one or more processors, cause the computer system toschedule the request in a slot of the GPU, wherein the slot of the GPUis initialized based on the amount of GPU resources being available. 19.The non-transitory computer-readable storage medium of claim 13, whereinthe instructions that cause the computer system to identify the amountof GPU resources being available to process the request further includeinstructions that, when executed by the one or more processors, causethe computer system to determine a status of one or more slots of theGPU, wherein the one or more slots of the GPU are configured to receiveat least the request and the data associated with the request in orderto process the request.
 20. The non-transitory computer-readable storagemedium of claim 13, wherein the instructions further compriseinstructions that, when executed by the one or more processors, causethe computer system to: determine if one or more other requests areassigned to the GPU; identify the one or more other requests assigned tothe GPU to be evicted from the GPU; and evict the one or more otherrequests from the GPU.